A Statistical Investigation into the Geographic Characteristics of the DATA2X02 Cohort

Author

Harry Breden

Published

September 2, 2023

Code
# KNITR MUST BE VERSION 1.42 TO RENDER MAPS

#Library Imports
library('tidyverse')
library('gendercoder')
library('janitor')
library("scales")
library("sf")
library('ggmap')
library('plotly')
library('leaflet')
library('tippy')
library('xfun')
library('stringr')
library('kableExtra')
library('ggpubr')
library('flextable')
library("stringdist")
Code
#Needed to clean names for the inline code. More involved cleaning will be discussed.
raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv') |>
  janitor::clean_names()

1 Introduction

Code
tippy::tippy_this(elementId = "random_sample", tooltip = "When all members of a population have equal likelihood to be sampled.")
Code
tippy::tippy_this(elementId = "wam", tooltip = "Weighted Average Mark")

DATA2X02 is a group of two units – DATA2002 and DATA2902 – offered within the School of Mathematics and Statistics at The University of Sydney. The units teach “advanced data analytic skills for a wide range of problems and data” (The University of Sydney 2023) with a focus on statistical methods to analyse and answer a scientific question.

1.1 Survey Method and Random Sampling

The raw dataset provided was sourced from a cohort survey which aimed to gain insight into the units’ cohort. Despite efforts to encourage student participation in the survey through an Ed Discussion Announcement and multiple reminders in labs and lectures, the response rate was 41%. It is important to note that due to this method of communication, there exists an argument that the survey participants may not have been a random sample of DATA2X02 students.

Students who were less engaged – possibly not attending lectures, labs or interacting with the Ed Discussion Board – are considerably less likely to have completed the survey compared to their counterparts who received multiple prompts. Moreover, those who are more engaged are likely to take time out of their day to fill out the survey after a reminder. This is evidenced by the DATA2902 (the advanced stream of DATA2X02) had a response rate of 71% compared to DATA2002’s rate of 37%. Students could also submit the survey multiple times, which may have skewed the data towards an individual if one was to submit many different responses Whilst acknowledging these shortcomings of the sampling method and subsequent response pattern, it is asserted that the survey still offers a moderately random sample of the DATA2X02 cohort.

1.2 Sources of Bias

There are some potential biases that may have occurred during this survey.

  • Non-response Bias – As discussed in Section 1.1, there may have been a non-response bias within the survey. Specifically, we see a difference in response rates between DATA2902 and DATA2002 students. This may have skewed the sample data towards the population of DATA2902 students, rather than DATA2X02 as a whole. This would be an issue if there is a significant difference between the populations of the two units. This is not out of the question, as those who opt to take an advanced stream of a unit may be more willing to challenge themselves and put in more effort into their studies. Moreover, there is the possibly that students do not opt for an advanced unit in order to priorities other aspects of their life, such as work.

  • Social desirability/conformity bias – Many of the questions asked in the survey have an associated ‘socially desirable’. For example, students may, whether consciously or unconsciously, overestimate the amount of hours they exercise, or underestimate the amount of time they spend on social media as these answers come with positive social connotations. Moreover, students may want to conform to the expected answer of the population. An example of this may be the question on whether or not students had experience in R coding. The majority of the DATA2X02 would have had experience in R as it was taught in many prerequisite courses, so those who didn’t have experience may answer incorrectly to conform with the rest of the cohort.

  • Recall Bias – Even if students did not suffer from social desirability or conformity bias, they may have simply not been able to recall the correct answer for a question. An example of this would be someone’s WAM. Many students may not know their actual WAM (as it is not reported when getting results or on the online academic transcript), and so they could incorrectly recall it when answering the survey. An instance of this is seen in the WAMs reported, with three students reporting their WAM of 99 or above, a value that could potentially be less accurate due to difficulties in recall.

1.3 Possible Improvements

There are many possible improvements which would help to generate useful data. Many of the questions regarding numeric data did not specify units in which an answer should be in, or whether the units should be included in the answer. This can be changed by specifying units in the question and only allowing numeric data to be input into the survey rather than free text. One such question was How much sleep do you get (on avg, per day)?. A better wording of this question would be How much sleep (in hours per night) do you get on average?. This was also an issue for the question How tall are you?, where answers were not given in a uniform manner. Rewording to How tall are you in cm? would have produced data which required much less cleaning. This extend to What is your shoe size?, where students responded with both US and European shoe sizes which are on a very different scale (a US 10 is a 43 European).

There were also issues regarding the categorical data. The question Would you prefer to study at Fisher Library or SciTech Library? did not need to include an Other response, as any answer of this type would not be answering the question asked. Moreover, the question Do you work? did not align with the suggested responses given. This question should have been What is your current employment status?. A similar issue was seen in this question Do you submit assignments on time?, which should have been How often do you submit assignments on time?. Finally, some questions could have included some options and an Other response, rather than free text. This was a particular issue for What brand is your laptop? and What is your favourite social media platform?, where students gave answers in many different forms when referring to the same category, e.g. Apple and Macbook being the same laptop brand. By providing some pre-defined answers, this would reduce the need for data cleaning.

1.4 Report Outline

This report will focus of the geographical characteristics of the cohort, with the Postcode of each response being used as a proxy for where a student lives. Specifically, hypothesis testing will be used to determine the impact of a student’s geographical region on a variety of variables.

SA4s are the “largest sub-State regions” and “represent labour markets or groups of labour markets within each State and Territory”. (Australian Bureau of Statistics 2021), with each SA4 has approximately 300,000 - 500,000 residents in metropolitan areas. These regions will be used to group together students into the geographical areas with ‘geographical, social and economic similarities’ (Australian Bureau of Statistics 2021). Figure 1 is a map made using Leaflet (Cheng, Karambelkar, and Xie 2023) which showcases the SA4s of Greater Sydney1.

Code
sa4_df <- st_read('Data/1270055001_sa4_2016_aust_shape')
sa4_df_filter <- sa4_df |> filter(GCC_NAME16 == 'Greater Sydney')
Code
p_popup <- paste0("<strong>Name: </strong>", sa4_df_filter$SA4_NAME16)

leaflet(sa4_df_filter) %>%
  addPolygons(
    popup = p_popup,
    fillColor = 'lightblue',
    opacity = 1.0,
    weight = 2,
    color = "darkblue",
    fillOpacity = 0.2) %>%
  addTiles()
Figure 1: Map of SA4s in Greater Sydney

1.5 Data Cleaning

A variety of data cleaning has been done in R (R Core Team 2023) and utilised the tidyverse packages (Wickham et al. 2019). The janitor package (Firke 2023) was initially used to help standardise the names of each column so that a reproducible introduction could be made. A new naming convention for the columns was adopted based from Tarr (2023). Some summary tables have also been created using gt (Iannone et al. 2023).

Column Name Conversion Table
Code
raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv')

old_names = colnames(raw_df)

df <- raw_df

new_names = c("timestamp","n_units","task_approach","age",
              "life","fass_unit","fass_major","novel",
              "library","private_health","sugar_days","rent",
              "post_code","haircut_days","laptop_brand",
              "urinal_position","stall_position","n_weetbix","food_budget",
              "pineapple","living_arrangements","height","uni_travel_method",
              "feel_anxious","study_hrs","work","social_media",
              "gender","sleep_time","diet","random_number",
              "steak_preference","dominant_hand","normal_advanced","exercise_hrs",
              "employment_hrs","on_time","used_r_before","team_role",
              "social_media_hrs","uni_year","sport","wam","shoe_size")
# overwrite the old names with the new names:
colnames(df) = new_names
# combine old and new into a data frame:
name_combo = bind_cols(`New Names` = new_names, `Original Names` = old_names)
name_combo |> gt::gt() |> gt::tab_header(title = "Column Name Cleaning") |> gt::tab_options(heading.title.font.weight = 'bolder', column_labels.font.weight = 'bold')
Column Name Cleaning
New Names Original Names
timestamp Timestamp
n_units How many units are you enrolled in this semester?
task_approach When it comes to assignments / due tasks do you:
age How old are you?
life Do you tend to lean towards saying "yes" or towards saying "no" to things throughout life?
fass_unit Have you taken one or more units of study from the Faculty of Arts and Social Sciences?
fass_major Are you completing a major or minor in a subject area from the Faculty of Arts and Social Sciences?
novel Have you read a novel this year?
library Would you prefer to study at Fisher Library or SciTech Library?
private_health Do you have private health insurance?
sugar_days How many days in a week you normally consume sweets/chocolates/sugary drinks? (Exclude Diet/Sugar Free Drinks & sweets)?
rent Do you pay rent?
post_code What is your post code?
haircut_days How many days do you go between haircuts (on average)?
laptop_brand What brand is your laptop?
urinal_position You enter a public bathroom and find you're the only one there. There are three urinals on the wall for you to choose from. Which do you choose?
stall_position You enter a public bathroom and there are three stalls to choose from. All three are unoccupied. Which do you choose?
n_weetbix How many Weet-Bix would you typically eat in one sitting?
food_budget What is the average amount of money you spend each week on food/beverages?
pineapple Do you like pineapple on pizza?
living_arrangements What are your current living arrangements?
height How tall are you?
uni_travel_method How do you get to university?
feel_anxious How often would you say you feel anxious on a daily basis?
study_hrs How many hours a week do you spend studying?
work Do you work?
social_media What is your favourite social media platform?
gender What is your gender?
sleep_time How much sleep do you get (on avg, per day)?
diet What is your diet style?
random_number Pick a number at random between 0 and 9
steak_preference How do you like your steak cooked?
dominant_hand What is your dominant hand?
normal_advanced Which unit are you enrolled in?
exercise_hrs On average, how many hours each week do you spend exercising?
employment_hrs How many hours a week (on average) do you work in paid employment?
on_time Do you submit assignments on time?
used_r_before Have you ever used R before starting DATA2x02?
team_role What kind of role (active or passive) do you think you are when working as part of a team?
social_media_hrs How many hours do you spend on social media per day?
uni_year Which year of university are you currently in?
sport Which sports do you play most often?
wam What is your WAM?
shoe_size What is your shoe size?

The SA4 name of each row was also joined onto the survey data using a reference table made by Proctor (2023).

Code
sa4_postcode_df <- readr::read_csv('Data/sa4_postcode.csv') |> 
  select(c(`Postcode`, `SA4 Name`)) |> 
  unique() |> 
  filter(!((`Postcode` == 2232) & (`SA4 Name` == 'Southern Highlands and Shoalhaven')))

colnames(sa4_postcode_df) <- c('post_code', 'sa4_name')

sa4_postcode_df$post_code <- as.character(sa4_postcode_df$post_code) 

df$post_code <- as.character(gsub("[^0-9]", "", df$post_code))

df <- df |> left_join(sa4_postcode_df)

df |> count(sa4_name) |>
  arrange(desc(n)) |> 
  gt::gt() |> 
  gt::cols_label(sa4_name = "SA4 Name", n='Count of Students') |> 
  gt::tab_header(title = "Count of Students by SA4") |> 
  gt::tab_options(heading.title.font.weight = 'bolder', column_labels.font.weight = 'bold')
Count of Students by SA4
SA4 Name Count of Students
Sydney - City and Inner South 123
Sydney - Inner West 35
NA 31
Sydney - North Sydney and Hornsby 30
Sydney - Ryde 18
Sydney - Inner South West 16
Sydney - Parramatta 14
Sydney - Northern Beaches 11
Sydney - Eastern Suburbs 9
Sydney - Blacktown 6
Sydney - Outer West and Blue Mountains 6
Sydney - South West 5
Sydney - Baulkham Hills and Hawkesbury 2
Sydney - Outer South West 2
Sydney - Sutherland 2
Central Coast 1
Riverina 1

The SA4s were further grouped together geographically to collapse some of the groups with lower student counts. Figure 2 is a map of the groupings of SA4s into regions. A conversion table was generated using flextable (Gohel and Skintzos 2023).

SA4 to Region Conversion Table
Code
north_sydney = c('Sydney - North Sydney and Hornsby', 'Sydney - Ryde', 'Sydney - Northern Beaches')
city_and_eastern_suburbs = c('Sydney - City and Inner South', 'Sydney - Eastern Suburbs')
inner_west = c('Sydney - Inner West', 'Sydney - Parramatta', 'Sydney - Inner South West')
outer_south_west = c('Sydney - Blacktown', 'Sydney - South West', 'Sydney - Sutherland', 'Sydney - Outer West and Blue Mountains', 'Sydney - Outer South West')
riverina_and_central_coast = c('Sydney - Baulkham Hills and Hawkesbury', 'Central Coast', 'Riverina')

df <- df |> 
  mutate(geographic_regions = case_when(
    sa4_name %in% north_sydney ~ 'North Sydney',
    sa4_name %in% city_and_eastern_suburbs ~ 'City and Eastern Suburbs',
    sa4_name %in% inner_west ~ 'Inner West',
    !is.na(sa4_name) ~ 'Outer South West, Greater Sydney and Regional NSW',
    TRUE ~ NA
  ))

mapping_df <- df |> select(geographic_regions, sa4_name) |> 
  unique() |> 
  drop_na() |> 
  arrange(geographic_regions) |>
  mutate(`Region` = geographic_regions, `SA4 Name`=sa4_name) |> 
  select(Region, `SA4 Name`)


flextable(mapping_df) |> merge_v() |> theme_vanilla() |> width(2, 4) |> width(1, 2)

Region

SA4 Name

City and Eastern Suburbs

Sydney - City and Inner South

Sydney - Eastern Suburbs

Inner West

Sydney - Inner West

Sydney - Inner South West

Sydney - Parramatta

North Sydney

Sydney - North Sydney and Hornsby

Sydney - Ryde

Sydney - Northern Beaches

Outer South West, Greater Sydney and Regional NSW

Sydney - Blacktown

Sydney - Outer West and Blue Mountains

Sydney - South West

Central Coast

Riverina

Sydney - Baulkham Hills and Hawkesbury

Sydney - Sutherland

Sydney - Outer South West

Code
sa4_df_in_survey <- sa4_df |> filter(SA4_NAME16 %in% df$sa4_name)

sa4_df_in_survey <- sa4_df_in_survey |> 
  mutate(geographic_regions = case_when(
    SA4_NAME16 %in% north_sydney ~ 'North Sydney',
    SA4_NAME16 %in% city_and_eastern_suburbs ~ 'City and Eastern Suburbs',
    SA4_NAME16 %in% inner_west ~ 'Inner West',
    !is.na(SA4_NAME16) ~ 'Outer South West, Greater Sydney and Regional NSW',
    TRUE ~ NA
  ))

factpal <- colorFactor(c('darkblue', 'darkgreen', 'darkred', 'purple'), sa4_df_in_survey$geographic_regions)
p_popup <- paste0("<strong>Name: </strong>", sa4_df_in_survey$SA4_NAME16)

leaflet(sa4_df_in_survey) %>%
  addPolygons(
    popup = p_popup,
    fillColor = ~factpal(geographic_regions),
    opacity = 1.0,
    weight = 2,
    color = ~factpal(geographic_regions),
    fillOpacity = 0.1) %>%
  addTiles() %>% addLegend("bottomleft", pal = factpal, values = ~geographic_regions, title='Region')
Figure 2: Map of SA4s grouped into Regions for students in DATA2X02


A flagging column was made that identifed if someone travelled by car.

Code
df <- df |>
  mutate(car_flag = ifelse(str_detect(uni_travel_method, "Car"), "Drive", ifelse(is.na(uni_travel_method), NA, "Other")))

df |> count(car_flag) |> 
  gt::gt() |> 
  gt::cols_label(car_flag = "Does the Student Drive to Univeristy?", n='Count of Students') |> 
  gt::tab_header(title = "Count of Students by Whether or Not they Travel by Car") |> 
  gt::tab_options(heading.title.font.weight = 'bolder', column_labels.font.weight = 'bold')
Count of Students by Whether or Not they Travel by Car
Does the Student Drive to Univeristy? Count of Students
Drive 66
Other 242
NA 4


Employment hours of each respondent was binned into categories.

Code
bin_ranges <- c(0, 1, 10, Inf)
bin_labels <- c("0", "1-10","11+")

# Create a new column with binned values
df$employment_hrs_bin <- cut(df$employment_hrs, breaks = bin_ranges, labels = bin_labels, include.lowest = TRUE)

df |> count(employment_hrs_bin) |> 
  gt::gt() |> 
  gt::cols_label(employment_hrs_bin = "Employment Hours", n='Count of Students') |> 
  gt::tab_header(title = "Count of Students by Employment Hours") |> 
  gt::tab_options(heading.title.font.weight = 'bolder', column_labels.font.weight = 'bold')
Count of Students by Employment Hours
Employment Hours Count of Students
0 144
1-10 72
11+ 79
NA 17


Outliers of WAM where set to NA, as this may be international students who have a different WAM system or people who do not know their WAM. It was judged at \(\pm 3\) standard deviations from the mean.

Code
remove_outlier <- function(vec){
  threshold1 = mean(vec[!is.na(vec)]) + 3*sd(vec[!is.na(vec)])
  threshold2 = mean(vec[!is.na(vec)]) - 3*sd(vec[!is.na(vec)])
  vec[vec > threshold1 | vec < threshold2] <- NA
  return(vec)
}

df[['wam']] <- remove_outlier(df[['wam']])

df %>%
  
  ggplot(aes(x=wam)) + 
  
  geom_histogram(bins = 20, 
                 fill = "steelblue1", 
                 color = "black") + 
  
  labs(x="WAM", 
       y="Frequency", 
       title="Histogram of Students' WAM with Outliers Removed") + 
  
  theme(legend.position="none", 
        plot.background = element_rect(fill = "#ffffff", 
                                       linewidth = 0), 
        axis.title = element_text(face="bold"), 
        plot.title = element_text(face="bold", 
                                  size = 13, 
                                  hjust = 0.5))

Figure 3: Histogram of students’ WAM with outliers removed

2 Hypothesis Testing

2.1 Does Living in Sydney’s City and Eastern Suburbs Influence if Students Drive to University?

Given the University of Sydney is located in in Sydney’s City and Eastern Suburbs, it is suspected that students may opt for the use of public transport, rather than driving to university if they live close to the university. This is of interest as effective carbon emissions of the university can be reduced if more students use public transport.

Code
car_df <- df |> select(c(geographic_regions, car_flag)) |> mutate(geographic_regions =  ifelse(geographic_regions == 'City and Eastern Suburbs', 'City and Eastern Suburbs', 'Other')) |> drop_na() |> mutate(`Travel Method` = car_flag)

car_df |> ggplot() + aes(x=geographic_regions, fill=`Travel Method`) + geom_bar(colour = "black", #Creates a proportion bar chart
           linewidth = 0.5,
           position = "fill") + 
  labs(y="Proportion of Travel Method", #Changes the axis label and title
       x="Region", 
       title="Proportion of Students who drive to Univerity \n based on Geograhical Location",
       legend="Travel Method") + 
  
  theme(plot.background = element_rect(fill = "#ffffff", #Changes the aesthetics of the chart
                                       linewidth = 0),
        legend.background = element_rect(fill = "#ffffff", 
                                       linewidth = 0),
        panel.border = element_rect(colour = "black", fill=NA),
        legend.box.background = element_rect(colour = "black"),
        axis.title = element_text(face="bold"), 
        plot.title = element_text(face="bold", 
                                  size = 14, 
                                  hjust = 0.5)) + scale_y_continuous(labels = scales::percent) + scale_fill_brewer(palette = "Set2")

Figure 4: Proportion bar chart of travel method for different regions

A \(\chi^2\)-test for independence was performed at the \(\alpha = 0.05\) level on the below contingency table. A Monte-Carlo simulation of size \(6000\) was used to calculate the test statistic and \(p\) value.

Code
contingency_table <- table(car_df$geographic_regions, car_df$car_flag) |> as.data.frame.matrix()

contingency_table$`Region` = c('City and Eastern Suburbs', 'Other')

contingency_table |> gt::gt() |> 
  gt::cols_move_to_start(columns=c(`Region`)) |> 
  gt::tab_spanner(label = "Method of Travel", columns = 1:2) |> 
  gt::tab_header(title = "Count of Students by Method of Travel") |> 
  gt::tab_options(heading.title.font.weight = 'bolder', column_labels.font.weight = 'bold')
Count of Students by Method of Travel
Region Method of Travel
Drive Other
City and Eastern Suburbs 13 118
Other 48 101
Code
set.seed(1)
test <- chisq.test(table(car_df$car_flag, car_df$geographic_regions), simulate.p.value=TRUE, B=6000)
\(\chi^2\)-test for independence
  1. Hypothesis\(H_0\): The method of travel of a student is independent of living in the Sydney’s City and Eastern Suburbs. \(H_1\): There is some interdependence between method of travel and living in Sydney’s City or Inner South.

  2. Assumptions – The observations are independent, and the expected cell counts are greater than equal to 5. The observations are independent as this was a survey that could only be filled out once. There were zero expected cell counts less than 5, so these assumptions hold.

  3. Test Statistic\[T = \sum_{i=1}^2 \sum_{j=1}^2 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}\] Under \(H_0\), \(T\sim \chi^2_1\).

  4. Observed Test Statistic\(t_0=\) 20.3285.

  5. p-value – The proportion of simulated test statistics that were as or more extreme than \(t_0\) was \(p=\) 0.00017.

  6. Decision – As the \(p\) -value was \(<\alpha\), we can reject \(H_0\). This implies that is some interdependence between method of travel and living in Sydney’s City or Inner South.

2.2 Is academic performance significantly better for students living in North Sydney compared to those in the Inner West?

A student’s WAM is one measure of academic performance. Knowing if WAM is impacted by where students live could be useful to know, as it could allow the University to provide targeted academic help.

Code
wam_df <- df |> filter(geographic_regions %in% c('Inner West', 'North Sydney')) |> 
  select(geographic_regions, wam) |> 
  drop_na()

inner_west_wam <- filter(wam_df, geographic_regions=="Inner West")$wam
north_sydney_wam <- filter(wam_df, geographic_regions=="North Sydney")$wam

A Welch two-sample one-sided \(t\)-test at the \(\alpha = 0.05\) level was conducted to determine if the mean WAM of students in North Sydney is larger than those living in the Inner West. Initial EDA suggests this may be the case, with the mean WAM of students being 76.4 and 74.1 respectively. We can also generate a QQ-plot of students’ WAM, which shows the variable is normally distributed as it follows a linear regression.

Code
wam_df |> ggplot() + aes(y=wam, color=geographic_regions, x=geographic_regions, fill=geographic_regions) + geom_boxplot()+ scale_color_manual(values = c('darkgreen','darkred')) + scale_fill_manual(values = c(rgb(217/255, 227/255, 215/255), rgb(230/255, 215/255, 214/255))) + theme(legend.position="none") + 
  theme(plot.background = element_rect(fill = "#ffffff", #Changes the aesthetics of the chart
                                       linewidth = 0),
        panel.border = element_rect(colour = "black", fill=NA),
        legend.box.background = element_rect(colour = "black"),
        axis.title = element_text(face="bold"), 
        plot.title = element_text(face="bold", 
                                  size = 14, 
                                  hjust = 0.5)) + labs(y="WAM", #Changes the axis label and title
       x="Region", 
       title="A: Grouped Box Plot of WAM by Region")

ggqqplot(wam_df, x = "wam", facet.by = "geographic_regions", color = "geographic_regions", palette=c('darkgreen','darkred'), legend='none', title="B: QQ-plot of WAM") + theme(plot.background = element_rect(fill = "#ffffff", #Changes the aesthetics of the chart
                                       linewidth = 0),
        panel.border = element_rect(colour = "black", fill=NA),
        legend.box.background = element_rect(colour = "black"),
        axis.title = element_text(face="bold"), 
        plot.title = element_text(face="bold", 
                                  size = 14, 
                                  hjust = 0.5))

test <- t.test(north_sydney_wam, inner_west_wam, alternative = 'greater')

shapiro1 <- shapiro.test((wam_df |> filter(geographic_regions == 'Inner West'))$wam)
shapiro2 <- shapiro.test((wam_df |> filter(geographic_regions == 'North Sydney'))$wam)

degrees_of_freedom <- test$parameter

Figure 5: A: Box plot of WAMs of students from the Inner West and North Sydney

Figure 6: B: QQ-plot of WAMs of students from the Inner West and North Sydney


Welch two-sample one-sided \(t\)-test
  1. Hypothesis\(H_0\): The mean WAM of student from North Sydney \(\mu_{NS}\) equal the mean WAM of students from the Inner West \(\mu_{IW}\). \(H_1\): \(\mu_{NS}\) is greater than \(\mu_{NS}\).

  2. Assumptions – The observations of both groups were independently and identically distributed to \(\mathcal{N}(\mu_{i}, \sigma_{i}^2)\) for \(i=NS, IW\), and that the observations of each group were independent. The observations are independent as this was a survey that could only be filled out once. The above QQ-plot (Figure 6) shows that the WAM is normally distributed. Moreover using a Shapiro-Wilk test, both groups were consistent with a \(X\sim\mathcal{N}(\mu_{i}, \sigma_{i}^2\), with p values of 0 for Inner West and 0 for North Sydney.

  3. Test Statistic\[T=\frac{\overline{NS}-\overline{IW}}{\sqrt{\frac{S_{ns}^2}{n_{ns}}+\frac{S_{iw}^2}{n_{iw}}}}\] Here, \(S_{ns}^2\) and \(S_{iw}^2\) are the sample variance of the \(NS\) (North Sydney) and \(IW\) (Inner West) samples. Under \(H_0\), \(T\sim t_{\nu}\), where \(\nu=\) 104.47 as estimated from the data.

  4. Observed Test Statistic\(t_0=\) 1.2399

  5. p-value\(p = P\left(t_\nu \geq t_0\right)=\) 0.109

  6. Decision – As the \(p\) -value was \(<\alpha\), we can reject \(H_0\). This implies that the mean WAM of students from North Sydney is significantly greater than those who live in the Inner West.

2.3 Does a student’s Region have a significant influence on how many hours they work?

Initial exploration of the data set suggested that there was a non-uniform distribution of working hours across different regions. The proportion fo students working between one and 10 hours per week was relatively similar, and the main differences were observed when comparing the proportion of students working no or more than 11 hours a week.
Code
employment_df <- df |> select(c(geographic_regions, employment_hrs_bin)) |> drop_na() |> mutate(`Employment Hours per Week` = employment_hrs_bin)

employment_df |> ggplot() + aes(x=geographic_regions, fill=`Employment Hours per Week`) + geom_bar(colour = "black", #Creates a proportion bar chart
           linewidth = 0.5,
           position = "fill") + 
  labs(y="Proportion of Hours Worked Category", #Changes the axis label and title
       x="Region", 
       title="Proportion of Students in Hours Worked by Region",
       legend="Travel Method") + 
  
  theme(plot.background = element_rect(fill = "#ffffff", #Changes the aesthetics of the chart
                                       linewidth = 0),
        legend.background = element_rect(fill = "#ffffff", 
                                       linewidth = 0),
        panel.border = element_rect(colour = "black", fill=NA),
        legend.box.background = element_rect(colour = "black"),
        axis.title = element_text(face="bold"), 
        plot.title = element_text(face="bold", 
                                  size = 14, 
                                  hjust = 0.5)) + scale_y_continuous(labels = scales::percent) + scale_x_discrete(labels=c("City and \n Eastern Suburbs", "Inner West", "North Sydney", "Outer South West, \n Greater Sydney and \n Regional NSW")) + scale_fill_brewer(palette = "Set2")

Figure 7: Proportion bar chart of hours worked for different regions.

A \(\chi^2\)-test for independence was performed at the \(\alpha = 0.05\) level on the below contingency table. Yates’s correction for continuity was used in the test.

Code
contingency_table <- table(employment_df$geographic_regions, employment_df$employment_hrs_bin) |> as.data.frame.matrix()

contingency_table$`Region` = c('City and Eastern Suburbs', 'Inner West', 'North Sydney', 'Outer South West,\n Greater Sydney and \n Regional NSW')

contingency_table |> gt::gt() |> 
  gt::cols_move_to_start(columns=c(`Region`)) |> 
  gt::tab_spanner(label = "Hours Worked", columns = 1:3) |> 
  gt::tab_header(title = "Count of Students by Hours Worked") |> 
  gt::tab_options(heading.title.font.weight = 'bolder', column_labels.font.weight = 'bold')
Count of Students by Hours Worked
Region Hours Worked
0 1-10 11+
City and Eastern Suburbs 82 27 19
Inner West 34 15 14
North Sydney 15 15 27
Outer South West, Greater Sydney and Regional NSW 6 8 11
Code
test <- chisq.test(table(employment_df$geographic_regions, employment_df$employment_hrs_bin))

degrees_of_freedom <- test$parameter
\(\chi^2\)-test for independence
  1. Hypothesis\(H_0\): The amount of hours worked by a student is independent of their region. \(H_1\): There is some interdependence between amount of hours worked and region.

  2. Assumptions – The observations are independent, and the expected cell counts are greater than equal to 5. The observations are independent as this was a survey that could only be filled out once. There were zero expected cell counts less than 5, so these assumptions hold.

  3. Test Statistic\[T = \sum_{i=1}^3 \sum_{j=1}^4 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}\] Under \(H_0\), \(T\sim \chi^2_{6}\).

  4. Observed Test Statistic\(t_0=\) 35.8235

  5. p-value\(p=P(\chi^2_{6} \geq t_0)<0.001\).

  6. Decision – As the \(p\) -value was \(<\alpha\), we can reject \(H_0\). This implies that is some interdependence between hours worked in a week and a student’s region.

3 Conclusion

The geographic characteristics have been investigated during this report by grouping DATA2X02 students into regions and performing hypothesis tests on differing variables.

Throughout the analysis, it was seen that geographic regions played a statistically significant role in the distribution of Travel Method, WAM and Employment Hours per Week. Specifically, it was found that the method of travel of students is dependent on whether they live in Sydney’s City or Eastern Suburbs or not, the mean WAM of students from North Sydney is greater than those living in the Inner West, and employment hours per week was dependent on region.

Future investigation into DATA2X02 cohorts may look to validate these results (to see if they are consistent to all DATA2X02 cohorts or just the 2023 cohort), as well as source more specific geographical information about students, rather than using their postcode.

References

Australian Bureau of Statistics. 2021. “Statistical Area Level 4.” 2021. https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/main-structure-and-greater-capital-city-statistical-areas/statistical-area-level-4.
Australian Statistical Geography Standard. 2016. “Main Structure and Greater Capital City Statistical Areas.” 2016. https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1270.0.55.001July%202016?OpenDocument.
Cheng, Joe, Bhaskar Karambelkar, and Yihui Xie. 2023. Leaflet: Create Interactive Web Maps with the JavaScript ’Leaflet’library. https://CRAN.R-project.org/package=leaflet.
Firke, Sam. 2023. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
Gohel, David, and Panagiotis Skintzos. 2023. Flextable: Functions for Tabular Reporting. https://CRAN.R-project.org/package=flextable.
Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, and JooYoung Seo. 2023. Gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.
Proctor, Mattthew. 2023. “Postcodes in New South Wales (NSW).” 2023. https://www.matthewproctor.com/full_australian_postcodes_nsw.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Tarr, Garth. 2023. “DATA2002 Assignment: Data Importing and Cleaning Guide.” 2023. https://pages.github.sydney.edu.au/DATA2002/2023/assignment/assignment_data.html.
The University of Sydney. 2023. “DATA2902.” 2023. https://www.sydney.edu.au/units/DATA2902.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Footnotes

  1. Shape Files used in this map are available here (Australian Statistical Geography Standard 2016)↩︎